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Abstract. We propose an automatic video inpainting algorithm which relies on the optimisation of a global, 
patch-based functional. Our algorithm is able to deal with a variety of challenging situations which 
naturally arise in video inpainting, such as the correct reconstruction of dynamic textures, multiple 
moving objects and moving background. Furthermore, we achieve this in an order of magnitude less 
execution time with respect to the state-of-the-art. We are also able to achieve good quality results 
on high definition videos. Finally, we provide specific algorithmic details to make implementation of 
our algorithm as easy as possible. The resulting algorithm requires no segmentation or manual input 
other than the definition of the inpainting mask, and can deal with a wider variety of situations 
than is handled by previous work. 
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1. Introduction. Advanced image and video editing techniques are increasingly common 
in the image processing and computer vision world, and are also starting to be used in media 
entertainment. One common and difficult task closely linked to the world of video editing is 
image and video “inpainting”. Generally speaking, this is the task of replacing the content of 
an image or video with some other content which is visually pleasing. This subject has been 
extensively studied in the case of images, to such an extent that commercial image inpainting 
products destined for the general public are available, such as Photoshop’s “Content Aware 
fill” [1]. However, while some impressive results have been obtained in the case of videos, the 
subject has been studied far less extensively than image inpainting. This relative lack of re¬ 
search can largely be attributed to high time complexity due to the added temporal dimension. 
Indeed, it has only very recently become possible to produce good quality inpainting results on 
high definition videos, and this only in a semi-automatic manner. Nevertheless, high-quality 
video inpainting has many important and useful applications such as film restoration, pro¬ 
fessional post-production in cinema and video editing for personal use. For this reason, we 
believe that an automatic, generic video inpainting algorithm would be extremely useful for 
both academic and professional communities. 

1.1. Prior work. The generic goal of replacing areas of arbitrary shapes and sizes in images 
by some other content was first presented by Masnou and Morel in [30]. This method used 
level-lines to disocclude the region to inpaint. The term “inpainting” was first introduced by 
Bertalmio et al. in [7]. Subsequently, a vast amount of research was done in the area of image 
inpainting [6], and to a lesser extent in video inpainting. 

Generally speaking, video inpainting algorithms belong to either the “object-based” or 
“patch-based” category. Object-based algorithms usually segment the video into moving fore- 
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ground objects and background that is either still or displays simple motion. These segmented 
image sequences are then inpainted using separate algorithms. The background is often in- 
painted using image inpainting methods such as [13], whereas moving objects are often copied 
into the occlusion as smoothly as possible. Unfortunately, such methods include restrictive 
hypotheses on the moving objects’ motion, such as strict periodicity. Some object-based 
methods include [ 2, 24, 28]. 

Patch-based methods are based on the intuitive idea of copying and pasting small video 
“patches” (rectangular cuboids of video information) into the occluded area. These patches 
are very useful as they provide a practical way of encoding local texture, structure and motion 
(in the video case). 

Patches were first introduced for texture synthesis in images [18], and subsequently used 
with great success in image inpainting [8, 13, 17]. These methods copy and paste patches into 
the occlusion in a greedy fashion, which means that no global coherence of the solution can be 
guaranteed. This general approach was extended by Patwardhan et al. to the spatio-temporal 
case in [34]. In [35], this approach was further improved so that moving cameras could be 
dealt with. This is reflected by the fact that a good segmentation of the scene into moving 
foreground objects and background is needed to produce good quality results. This lack of 
global coherence can be a drawback, especially for the correct inpainting of moving objects. 

Another method leading to a different family of algorithms was presented by Demanet 
et al. in [15]. The key insight here is that inpainting can be viewed as a labelling problem: 
each occluded pixel can be associated with an unoccluded pixel, and the final labels of the 
pixels result from a discrete optimisation process. This idea was subsequently followed by 
a series of image inpainting methods [22, 25, 29, 36] which use optimisation techniques such 
as graph cuts algorithms [9], to minimise a global patch-based functional. The vector field 
representing the correspondences between occluded and unoccluded pixels is referred to as 
the shift map by Pritch et al. [36], a term which we shall use in the current work. This idea 
was extended to the video case by Granados et al. in [20]. They propose a semi-automatic 
algorithm which optimises the spatio-temporal shift map. This algorithm presents impressive 
results on higher resolution images than are previously found in the literature (up to 1120 x 754 
pixels). However, in order to reduce the large search space and high time complexity of the 
optimisation method, manual tracking of moving occluded objects is required. To the best of 
our knowledge, the inpainting results of Granados et al. are the most advanced to date, and 
we shall therefore compare our algorithm with these results. 

We also note the work of Herling and Broil [23] whose goal is “diminished reality”, which 
considers the inpainting task coupled with a tracking problem. This is the only approach of 
which we are aware which inpaints videos in a real-time manner. However, the method relies 
on restrictive hypotheses on the nature of the scene to inpaint and can therefore only deal 
with tasks such as removing a static object from a rigid platform. 

Another family of patch-based video inpainting methods was introduced in the seminal 
work of Wexler et al. [38]. This paper proposes an iterative method that may be seen as an 
heuristic to solve a global optimisation problem. This work is widely cited and well-known 
in the video inpainting domain, mainly because it ensures global coherency in an automatic 
manner. This method is in fact closely linked to methods such as non-local denoising [10]. 
This link was also noted in the work of Arias et al. [3], which introduced a general non-local 
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patch-based variational framework for inpainting in the image case. In fact, the algorithm of 
Wexler et al may be seen as a special case of this framework. We shall refer to this general 
approach as the non-local patch-based approach. Darabi et al. have presented another variation 
on the work of Wexler et al. for image inpainting purposes in [14]. In the video inpainting case, 
the high dimensionality of the problem makes such approaches extremely slow, in particular 
due to the nearest neighbour search, requiring up to several days for a few seconds of VGA 
video. This problem was considered in our previous work [31], which represented a first step 
towards achieving high quality video inpainting results even on high resolution videos. Given 
the flexibility and potential of the non-local patch-based approach, we use it here to form 
the core of the proposed method. In order to obtain an algorithm which can deal with a 
wide variety of complex situations, we consider and propose solutions to some of the most 
important questions which arise in video inpainting. 

1.2. Outline and contributions of the paper. The ultimate goal of our work is to produce 
an automatic and generic video inpainting algorithm which can deal with complex and varied 
situations. In Section 2, we present the variational framework and the notations which we 
shall use in this paper. The proposed algorithm and contributions are presented in Section 3. 
These contributions can be summarised with the following points: 

• we greatly accelerate the nearest neighbour search, using an extension of the Patch- 
Match algorithm [5] to the spatio-temporal case (Section 3.1); 

• we introduce texture features to the patch distance in order to correctly inpaint video 
textures (Section 3.3); 

• we deal with the problem of moving background (Section 3.4) in videos, using a robust 
affine estimation of the dominant motion in the video; 

• we describe our initialisation scheme (Section 3.5), which is often left unspecified in 
previous work; 

• we give precise details concerning the implementation of the multi-resolution scheme 
(Section 3.6). 

As shown in Section 4, the proposed algorithm produces high quality results in an auto¬ 
matic manner, in a wide range of complex video inpainting situations: moving cameras, mul¬ 
tiple moving objects, changing background and dynamic video textures. No pre-segmentation 
of the video is required. One of the most significant advantages of the proposed method is 
that it deals with all these situations in a single framework, rather than having to resort to 
separate algorithms for each case, as in [20] and [19] for instance. In particular, the problem of 
reconstructing dynamic video textures has not been previously addressed in other inpainting 
algorithms. This is a worthwhile advantage since the problem of synthesising video textures, 
which is usually done with dedicated algorithms, can be achieved in this single, coherent 
framework. Finally, our algorithm does not need any manual input other than the inpainting 
mask, and does not rely on foreground/background segmentation, which is the case of many 
other approaches [12, 20, 24, 28, 35]. 

We provide an implementation of our algorithm, which is available at the following address: 
http://www.telecom-paristech.fr/~gousseau/video_inpainting. 

2. Variational framework and notation. As we have stated in the introduction, our video 
inpainting algorithm takes a non-local patch-based approach. At the heart of such algorithms 
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Figure 1. Illustration of the notation used for proposed video inpainting algorithm. 


lies a global patch-based functional which is to be optimised. This optimisation is carried 
out using an iterative algorithm inspired by the work of Wexler et al. [39]. The central 
machinery of the algorithm is based on the alternation of two core steps: a search for the 
nearest neighbours of patches which contain occluded pixels, and a reconstruction step based 
on the aggregation of the information provided by the nearest neighbours. 

This iterative algorithm is embedded in a multi-resolution pyramid scheme, similarly to 
[17, 39]. The multi-resolution scheme is vital for the correct reconstruction of structures and 
moving objects in large occlusions. The precise details concerning the multi-resolution scheme 
can be found in Section 3.6. 

A summary of the different steps in our algorithm can be seen in Figure 2, and the whole 
algorithm may be seen in Alg. 4. 

Notation. Before describing our algorithm, let us first of all set down some notation. A 
diagram which summarises this notation can be seen in Figure 1. Let u : » R 3 represent 

the colour video content, defined over a spatio-temporal volume Q. In order to simplify the 
notation, u will correspond both to the information being reconstructed inside the occlusion, 
and the unoccluded information which will be used for inpainting. We denote a spatio- 
temporal position in the video as p — (x, y,t) G and by u(p) G R 3 the vector containing the 
colour values of the video at this position. 

Let T~i be the spatio-temporal occlusion (the “hole” to inpaint) and V the data set (the 
unoccluded area). Note that T~L and V correspond to spatio-temporal positions rather than 
actual video content and they form a partition of f2, that is ft = T~L U V and KnP = 0. 

Let Af p be a spatio-temporal neighbourhood of p. This neighbourhood is defined as a 
rectangular cuboid centred on p. The video patch centered at p is defined as vector W™ = 
(u(qi) • • • u(q N )) of size 3 x TV, where the N pixels in Af p , q\ • • • are ordered in a predefined 
way. 

Let us note V — {p G V : Af p C V} the set of unoccluded pixels whose neighborhood is 
also unoccluded (video patch W™ is only composed of known color values). We shall only use 
patches stemming from V to inpaint the occlusion. Also, let T~L = represent a dilated 

version of 7 -L. 

Given a distance d(-, •) between video patches, a key tool for patch-based inpainting is to 
define a correspondence map that associates to each pixel p G (notably those in occlusion) 
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another position gGP, such that patches W p and W™ are as similar as possible. This can be 
formalized using the so-called shift map <f> : D —>► R 3 that captures the shift between a position 
and its correspondent, that is q — p-\- <fi(p) is the “correspondent” of p. This map must verify 
that p + (j){p) G X>, Mp (see Figure 1 for an illustration). 

Mimizing a non-local patch-based functional. The cost function which we use, following the 
work of Wexler et al. [39], has both u and (j) as arguments: 

(2-1) d 2 (W^,W^ {p) ), 

pen 

with 

(2-2) d\W£,W^ {p) ) = ^ E \Hq)-u{q + m)\\l 

qCJ\f p 

In all that follows, in order to avoid cumbersome notation we shall drop the u from W p 
and simply denote patches as W p . 

We show in Appendix A, that this functional is in fact a special case of the formulation 
of Arias et al. [3]. As mentioned at the beginning of this Section, this functional is optimised 
using the following two steps: 

Matching Given current video u , find in V the nearest neighbour (NN) of each patch W p 
that has pixels in inpainting domain H, that is, the map </>(p), Mp G D \ V. 
Reconstruction Given shift map </>, attribute a new value u(p) to each pixel p G H. 

These steps are iterated so as to converge to a satisfactory solution. The process may be 
seen as an alternated minimisation of cost (2.1) over the shift map and the video content u. 
As in many image processing and computer vision problems, this approach is implemented in 
a multi-resolution framework in order to improve results and avoid local minima. 

3. Proposed algorithm. Now that we have given a general outline of our algorithm, we 
proceed to address some of the key challenges in video inpainting. The first of these concerns 
the search for the nearest neighbours of patches centred on pixels which need to be inpainted. 

3.1. Approximate Nearest Neighbour (ANN) search. When considering the high com¬ 
plexity of the NN search step, it quickly becomes apparent that searching for exact nearest 
neighbours would take far too long. Therefore, an approximate nearest neighbour (ANN) 
search is carried out. Wexler et al. proposed the k-d tree based approach of Arya and Mount 
[4] for this step, but this approach remains quite slow. For example, one ANN search step 
takes about an hour for a video containing 120 x 340 x 100 pixels, with about 422, 000 missing 
pixels, which represents a relatively small occlusion (the equivalent of a 65 x 65 pixel box in 
each frame). We shall address this problem here, in particular by using an extension of the 
PatchMatch algorithm [5] to the spatio-temporal case. We note that the PatchMatch algo¬ 
rithm has also been used in conjunction with a 2D version of Wexler’s algorithm for image 
inpainting, in the Content-Aware Fill tool of Photoshop [1], and by Darabi et al. [14]. 

Barnes et aV s PatchMatch is a conceptually simple algorithm based on the hypothesis 
that, in the case of image patches, the shift map defined by the spatial offsets between ANNs 
is piece-wise constant. This is essentially because the image elements which the ANNs connect 
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Figure 2. Diagram of the proposed video inpainting algorithm. 
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are often on rigid objects of a certain size. In essence, the algorithm looks randomly for ANNs 
and tries to “spread” those which are good. We extend this principle to the spatio-temporal 
setting. Our spatio-temporal extension of the PatchMatch algorithm consists of three steps: 
(i) initialisation, (ii) propagation and (iii) random search. 

Let us recall that T~L is a dilated version of T~L. Initialisation consists of randomly associating 
an ANN to each patch W p , p E 7^, which gives an initial ANN shift map, 0. In fact, apart 
from the first iteration, we already have a good initialisation: the shift map (j) from the 
previous iteration. Therefore, except during the initialisation step (see Section 3.5), we use 
this previous shift map in our algorithm instead of initialising randomly. 

The propagation step encourages shifts in </> which lead to good ANNs to be spread 
throughout </>. In this step, all positions in the video volume are scanned lexicographically. 
For a given patch W p at location p = (x,y,t), the algorithm considers the following three 
candidates : ^p+<j>{x,y-i,t) an d Wp+^(a:,y,t-i)- If one °f these three patches has 

a smaller patch distance with respect to W p than W p+ ^ p p then c/)(p) is replaced with the 
new, better shift. The scanning order is reversed for the next iteration of the propagation, 
and the algorithm tests Wp+^+i^), ^(p+^y+i,*) and W p+<t> ^ t+1 y In the two different 
scanning orderings, the important point is obviously to use the patches which have already 
been processed in the current propagation step. 

The third step, the random search, consists in looking randomly for better ANNs of each 
W p in an increasingly small area around p + </>(p), starting with a maximum search distance. 
At iteration fc, the random candidates are centred at the following positions: 

( 3 . 1 ) q = p + <j>{p) + [r max p k 5 k \, 

where r max is the maximum search radius around p + </>(p), 5k is a 3-dimensional vector drawn 
from the uniform distribution over unit cube [—1, 1] X [-1,1] X [-1,1] and p E (0,1) is the 
reduction factor of the search window size. In the original PatchMatch, p is set to 0.5. This 
random search avoids the algorithm getting stuck in local minima. The maximum search 
parameter r max is set to the maximum dimension of the video, at the current resolution level. 

The propagation and random search steps are iterated several times to converge to a good 
solution. In our work, we set this number of iterations to 10. For further details concerning the 
PatchMatch algorithm in the 2D case, see [5]. Our spatio-temporal extension is summarized 
in Algorithm 1. 

We note here that other ANN search methods for image patches exist which outperform 
PatchMatch [21, 26, 33]. However, in practice, PatchMatch appeared to be a good option 
because of its conceptual simplicity and nonetheless very good performance. Furthermore, to 
take the example of the “TreeCANN” method of Olonetsky and Avidan [33], the reported re¬ 
duction in execution time is largely based on a very good ANN shift map initialisation followed 
by a small number of propagation steps. In our case, we already have a good initialisation 
(from the previous iteration), which makes the usefulness of such approaches questionable. 
However, further acceleration is certainly something which could be developed in the future. 

3.2. Video reconstruction. Concerning the reconstruction step, we use a a weighted mean 
based approach, inspired by the work of Wexler et al. , in which each pixel is reconstructed in 
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Algorithm 1: ANN search with 3D PatchMatch 

Data: Current inpainting configuration u , 0, % 

Result: ANN shift map 0 

for k = 1 to 10 do 

if k is even then /* Propagation on even iteration */ 

for p — pito P|^| (pixels in T~L lexicographically ordered) do 
a = P ~ (1,0,0), b = p - (0,1,0 ), c = p- (0,0,1); 
q = arg miiVefoaAc} d(W£, W^ +(f>{r) )\ 

if p + 4>(q) € V then <fr(p) <- cf>(q) 

end 

else /* Propagation on odd iteration */ 

for p = P|^| to pi do 

a = P + (1,0,0), b = p + (0,1,0 ), c = p+ (0,0,1); 
q = arg mm re{PtaAc} d{W£, W^ (r) ); 

if p + 4>(q) € V then <fr(p) <- cf>(q) 

end 

end 

for p = pi to P|^| do /* Random search */ 

q = p + <£(p) + L ^maxP fc RandUniform([—1, l] 3 ) J; 

if d(W^ w^ {q) ) < d(W .- and p + 0(g) G P then 0(p) <- 0(g) 

end 

end 


the following manner: 

(3.2) 
with 

(3.3) 


u(p) = 


E g& V P 8 P M (P + ^(g)) 





d 2 (W q ,W q+m )\ 

H y 


Wexler et al proposed the use of an additional weighting term to give more weight to 
the information near the occlusion border. We dispense with this term, since in our scheme 
it is somewhat replaced by our method of initialising the solution which will be detailed in 
Section 3.5. Parameter a p is defined as the 75th percentile of all distances {d(W q , W^+ 0 (g)), q G 
Af p } as in [39]. 

Observe that in order to minimise ( 2 . 1 ), the natural approach would be to do the recon¬ 
struction with the non-weighted scheme (s p = 1 in Equation 3.2) that stems from t^|j = 0. 
However, the weighted scheme above tends to accelerate the convergence of the algorithm, 
meaning that we produce good results faster. 
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Figure 3. Comparison of different final reconstruction methods. We observe that the proposed 
reconstruction using only the best patch at the end of the algorithm produces similar results to the use of 
the mean shift algorithm, avoiding blur induced by weighted patch averaging, while being less computationally 
expensive. Please note that the blurring effect is best viewed in the pdf version of the paper. 


An important observation is that, in the case of regions with high frequency details, the 
use of this mean reconstruction (weighted or unweighted) often leads to blurry results, even if 
the correct patches have been identified. This phenomenon was also noted in [39]. Although 
we shall propose in Section 3.3 a method to correctly identify textured patches in the matching 
steps, this does not deal with the reconstruction of the video. We thus need to address this 
problem, at least in the final stage of the approach: throughout the algorithm, we use the 
unweighted mean given in Equation 3.2 and, at the end of the algorithm, when the solution 
has converged at the finest pyramid level, we simply inpaint the occlusion using the best 
patch among the contributing ones. This corresponds to setting cr p to 0 in (3.2-3.3) or, seen 
in another light, may be viewed as a very crude annealing procedure. Final reconstruction at 
position p G T~L reads: 

(3.4) « (final) (p) = u(p + </> (final) (9*)), with q* = arg min diwf 1nal “ 1 \ W q+ ^ )[q) ). 

Another solution to this problem based on the mean shift algorithm was proposed by Wexler 
ct al in [39], but such an approach increases the complexity and execution time of the 
algorithm. Figure 3 shows that very similar results to those in [39] may be obtained with our 
much simpler approach. 

3.3. Video texture pyramid. In order for any patch-based inpainting algorithm to work, 
it is necessary that the patch distance identify “correct” patches. This is not the case in 
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Figure 4. Illustration of the necessity of texture features for inpainting. Without the texture 
features, the correct texture may not be found. 

several situations. Firstly, as noticed by Liu and Caselles in [29], the use of multi-resolution 
pyramids can make patch comparisons ambiguous, especially in the case of textures, in images 
and videos. Secondly, it turns out that the commonly used £ 2 patch distance is ill-adapted to 
comparing textured patches. Thirdly, PatchMatch itself can contribute to the identification of 
incorrect patches. These reasons are explored and explained more extensively in Appendix B. 
A visual illustration of the problem may be seen in Figure 4. We note here that similar 
ambiguities were also identified by Bugeau et al. in [11], but their interpretation of the 
phenomenon was somewhat different. 

We propose a solution to this problem in the form of a multi-resolution pyramid which 
reflects the textural nature of the video. We shall refer to this as the texture feature pyramid. 
The information in this texture feature pyramid is added to the patch distance in order to 
identify the correct patches. 

In order to identify textures, we shall consider a simple, gradient-based texture attribute. 
Following Liu and Caselles [29], we consider the absolute value of the image derivatives, 
averaged over a certain spatial neighbourhood v. Obviously, many other attributes could be 
considered, however they seemed too involved for our purposes. 

More formally, we introduce the two-dimensional texture feature T — ( T x ,T y ), computed 
at each pixel p G ft: 

(3-5) rW = ?diM£ (l4(,)l ’ |i »<' , > l >' 

' ' qen 

where I x (q ) (resp. I y (q )) is the derivative of the image intensity (grey-level) in the x (resp. 
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y ) direction at the pixel q. The squared patch distance is now defined as 

(3.6) d 2 (W p ,W q ) = (|K0-«(r-p + g))||| + A||T(r)-T(r-p + <?)|||), 

reJ\f p 

where A is a weighting scalar. 

The feature pyramid is then set up by subsampling the texture features of the full- 
resolution input video u. We note here that each level is obtained by subsampling the in¬ 
formation contained at the finest pyramid resolution , and not by calculating T^ based on the 
subsampled video at the level i\ 

(3.7) V(®, y, t ) G Sl e , T e (x, y, t ) = T( 2 e x, 2 e y, t ), £ = 1 • • • L. 

This is an important point, since the required textural information does not exist at coarser 
levels. Features are not filtered before subsampling, since they have already been averaged 
over the neighbourhood v. In the experiments done in this paper, this neighbourhood is set, 
by default, to the area to which a coarsest-level pixel corresponds, which is a square of size 
2 l -\ as is done in [29]. However, in a more general setting, the size of this area should be 
independent of the number of levels, so care should be taken in the case where few pyramid 
levels are used. 

A notable difference with respect to the work of Liu and Caselles [29] is the fact that we 
use the texture features at all pyramid levels. Liu and Caselles do not do this, since they 
perform graph cut based optimisation at the coarsest level, and at the finer levels only consider 
small relative shifts with respect to the coarse solution. 

A final choice which must be made when using the texture features is how they are 
themselves reconstructed. In shift maps based algorithms, this is not a problem, since by 
definition an occluded pixel takes on the characteristics of its correspondent in V (colour, 
texture features or anything else). 

In our case, we inpaint the texture features using the same reconstruction scheme as is 
used for colour information (see Eq. 3.2): 


(3.8) 


Tip) 


E g &v p g p r (p + ^(g)) 

' ([ G.Vp ^P 


VpeH. 


Conceptually, the use of these features is quite simple, and easily fits into our inpainting 
framework. To summarise, these features may be seen as simple texture descriptors which 
help the algorithm avoid making mistakes when choosing the area to use for inpainting. 

The methodology which we have proposed for dealing with dynamic video textures is 
important for the following reasons. Firstly, to the best of our knowledge, this is the first 
inpainting approach which proposes a global optimisation and which can deal correctly with 
textures in images and videos, without restricting the search space (contrary to [13, 14, 20, 
29, 36]). Secondly, while the problem of recreating video textures is a research subject in its 
own right and algorithms have been developed for their synthesis [16, 27, 37], ours is the first 
algorithm to achieve this within an inpainting framework. Finally we note that algorithms such 
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Original frame : “Jumping Inpainting without 

man” [19] realignement 




Inpainting with realignement 


Figure 5. A comparison of inpainting results with and without affine realignment. Notice the 
incoherent reconstruction on the steps, due to random camera motion which makes spatio-temporal patches 
difficult to compare. These random motions are corrected with affine motion estimation and inpainting is 
performed in the realigned domain. 

as that of Granados et al. [19], which are specifically dedicated to background reconstruction, 
cannot deal with background video textures (such as waves), since they suppose that the 
background is rigid. This hypothesis is clearly not true for video textures. An example of the 
impact of the texture features in video inpainting may be seen in Figure 7. 

3.4. Inpainting with mobile background. We now turn to another common case of video 
inpainting, that of mobile backgrounds. This is the case, for example, when hand-held cameras 
are used to capture the input video. 

There are several possible solutions to this problem. Patwardhan et al. [35] segment the 
video into moving foreground and background (which may also display motion) using motion 
estimation with block matching. Once the moving foreground is inpainted, the background 
is realigned with respect to a motion estimated with block matching, and the background 
is filled by copying and pasting background pixels. In this case, the background should be 
perfectly realigned. 

Granados et al [19] propose a homography-based algorithm for this task. They estimate 
a set of homographies between each frame and choose which homography should be used for 
each occluded pixel belonging to the background. 

Both of these algorithms require that the background and foreground be segmented, which 
we wish to avoid here. Furthermore, they have the quite strict limitation that pixels are 
simply copied from their realigned positions, meaning that the realignment must be extremely 
accurate. Here, we propose a solution which allows us to use our patch-based variational 
framework for both tasks (foreground and background inpainting) simultaneously, without 
any segmentation of the video into foreground and background. 

The fundamental hypothesis behind patch-based methods in images or videos is that 
content is redundant and repetitive. This is easy to see in images, and may appear to be the 
case in videos. However, the temporal dimension is added in video patches, meaning that 
a sequence of image patches should be repeated throughout the video. This is not the case 
when a video displays random motion (as with a mobile camera): even if the required content 
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Algorithm 2: Pre-processing step for realigning the input video. 
Data: Input video u 
Result: Aligned video and affine warps 

Nf = number of frames in video; 

Nm = L^J; 

for n = 1 to Nf — 1 do 

9n , n +1 Estimate AffineMotion(?/ n , u n +i)\ 

end 

for n — 1 to TVj do 

if n < N m then 0 n>Nm = 0N m -i,N m o • • • o 0 n>re+ i if n > N m then 
On,N m = OnI-I ,N m ° • • • ° d n}n+1 u n AffineWrap (Un, 0 n ,Nm)', 

end 


appears at some point in the sequence, there is no guarantee that the required spatio-temporal 
patches will repeat themselves with the same motion. Empirically, we have observed that this 
is a significant problem even in the case of motions with small amplitude. 

To counter this problem, we estimate a dominant, affine motion between each pair of 
successive frames, and use this to realign each frame with respect to one reference frame. In 
our work, we chose the reference frame to be the middle frame of the sequence (this should 
be adapted for larger sequences). We use the work of Odobez and Bouthemy [32] to realign 
the frames. The colour values of the pixels in the realigned frames are obtained using linear 
interpolation. The occlusion T~L is obviously also realigned. Once the frames are realigned 
with the reference frame (Alg. 2 ), we inpaint the video as usual. Finally, when inpainting is 
finished, we perform the inverse affine transformation on the images and paste the solution 
into the original occluded area. Figure 5 compares the results with and without this pre¬ 
processing step, on a prototypical example. Without it, it is not possible to find coherent 
patches which respect the border conditions. 

3.5. Initialisation of the solution. The iterative procedure at the heart of our algorithm 
relies on an initial inpainting solution. The initialisation step is very often left unspecified in 
work on video inpainting. As we shall see in this Section, it plays a vital role, and we therefore 
explain our chosen initialisation method in detail. 

We inpaint at the coarsest level using an “onion peel” approach, that is to say we inpaint 
one layer of the occlusion at a time, each layer being one pixel thick. 

More formally, let li! C T~L be the current occlusion, and dT-U C H' the current layer to 
inpaint. We define the unoccluded neighbourhood J\fp of a pixel p, with respect to the current 
occlusion T~L' as: 

(3.9) Up = {q € U p , q £ H'}. 

Some choices are needed to implement this initialisation method. First of all, we only 
compare the unoccluded pixels during a patch comparison. The distance between two patches 
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(a) Occluded input 
image 



(a) Original image 



(b) Random 
initialisation 



(b) Inpainting after 
random 
initialisation 



(c) Smooth 
zero-Laplacian 
interpolation 



(c) Inpainting after 
zero Laplacian 



(d) Onion peel 
initialisation 



(d) Inpainting after 
onion peel 
initialisation 


Figure 6. Impact of initialisation schemes on inpainting results. Note that the only scheme 
which successfully reconstructs the cardboard tube is the u onion peel” approach where first filling at the coarsest 
resolution is conducted in a greedy fashion. 

W p and is therefore redefined as: 

(3.10) d\W p ,W p+4>{p) ) = J2 (H9)-«(9 + 0(p))lli + A||T( 9 )-T( 9 + 0(p))||l). 

1 pl 9 eA/J 

We also need to choose which neighbouring patches to use for reconstruction. Some will 
be quite unreliable, as only a small part of the patches are compared. In our implementation, 
we only use the ANNs of patches whose centres are located outside the current occlusion 
layer. Formally, we reconstruct the pixels in the current layer by using the following formula, 
modified from Equation 3.2: 


(3.11) 


s p u (p + <Kq)) 

ML S P 


The same reconstruction is applied to the texture features. A pseudo-code for the initialisation 
procedure may be seen in Algorithm 3. 
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Algorithm 3: Inpainting initialisation 

Data: Coarse level inputs u L , T L , </> L , T~L l 
Result: Coarse initial filling u L , T L and map cj) L 

B ^ 3 x 3 x 3 structuring element; 

V! <- 

while T~L r ^ 0 do 

<- W / \Erosion(W , ,S); 

(f) L <— ANNsearch(rr L ,T L , </> L , cW') ; // Alg.l, with partial distance (3.10) 
u L Reconstruction(^ L , 0 L , cW') ; // Eq.3.11 

T l Reconstruction(T L , </> L , c^ 7 ) ; //Eq.3.11 

<- Erosion(7/', £>); 

end 


Figure 6 shows some evidence to support a careful initialisation scheme. Three different 
initialisations have been tested: random initialisation, zero-Laplacian interpolation and onion 
peel. Random initialisation is achieved by initialising the occlusion with pixels chosen ran¬ 
domly from the image. Zero Laplacian (harmonic) interpolation is the solution of the Laplace 
equation A u = 0 with Dirichlet boundary conditions stemming from V. It may be seen (Fig¬ 
ure 6) that the first two initialisations are unable to join the two parts of the cardboard tube 
together, and that the subsequent iterations do not improve the situation. In contrast, the 
proposed initialisation produces a satisfactory result. 

In order to make our method as easy as possible to reimplement, we now present some 
further algorithmic details, which are in fact very important for achieving good results. 

3.6. Implementation of the multi-resolution scheme. Our first remarks concern the 
implementation of the multi-resolution pyramids. Wexler et al. and Granados et al. both 
note that temporal subsampling can be detrimental to inpainting results. This is due to the 
difficulty of representing motion at coarser levels. For this reason we do not subsample in 
the temporal direction, as in [20]. The only case where we need to temporally subsample is 
when the objects spend a long time behind the occlusion, (this was only done in the “Jumping 
girl” sequence). This is quite a hard problem to solve since it becomes increasingly difficult 
to decide what motion an occluded object should have when the occlusion time grows longer, 
unless there is strictly periodic motion. We leave this as an open question, which could be 
investigated in further work. 

A crucial choice when using multi-resolution schemes is the number of pyramid levels to 
use. Most other methods leave this parameter unspecified, or fix the size of the image/video at 
the coarsest scale, and determine the resulting number of levels [20, 29, 36]. In fact, when one 
considers the problem in more detail, it becomes apparent that the number of levels should be 
set so that the occlusion size is not too large in comparison to the patch size. This intuition 
is supported by experiments in very simple image inpainting situations, which showed that 
the occlusion size should be somewhat less than twice the patch size. In our experiments, we 
follow this general rule of thumb. 
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Original frame : “Waves” Inpainting without features Inpainting with features 

Figure 7. Usefulness of the proposed texture features. Without the features, the algorithm fails to 
recreate correctly the waves, which is a typical example of complex video texture. 


Another question which is of interest is how to pass from one pyramid level to another. 

Wexler et al. presented quite an intricate scheme in [39] to do this, whereas Granados et al. 
propose a simple upsampling of the shift map. This is conceptually simpler than the approach 
of Wexler et al. and, after experimentation, we chose this option as well. Therefore, the shift 
map (j) is upsampled using nearest neighbours interpolation, and both the higher resolution 
video and the higher resolution texture features are reconstructed using Equation 3.2. One 
final note on this point is that we use the upsampled version of (j) as an initialisation for the 
PatchMatch algorithm at each level apart from the coarsest (this differs to our previous work 
in [31]). 

3.7. Various implementation details. We also require a threshold which will stop the 
iterations of the ANN search and reconstruction steps. In our work, we use the average colour 
difference in each channel per pixel between iterations as a stopping criterion. If this falls 
below a certain threshold, we stop the iteration at the current level. We set this threshold 
to 0.1. In order to avoid iterating for too long, so we also impose a maximum number of 20 
iterations at any given pyramid level. 

The patch size parameters were set to 5 x 5 x 5 in all of our experiments. We set the 
texture feature parameter A to 50. Concerning the spatio-temporal PatchMatch, we use ten 
iterations of propagation/random search during the ANN search algorithm and set the window 
size reduction factor /3 to 0.5 (as in the paper of Barnes et al. [5]). 

The complete algorithm is summarized in Alg. 4. 

4. Experimental results. The goal of our work is to achieve high quality inpainting results 
in varied, complex video inpainting situations, with reduced execution time. Therefore, we 
shall evaluate our results in terms of visual quality and execution time. 

We compare our work to that of Wexler et al. [39] and to the most recent video inpainting 
method of Granados et al. [20]. All of the videos in this paper (and more) can be viewed and 
downloaded along with occlusion masks at http : //www. telecom-paristech. f r/~gousseau/video_inpaintin 
An implementation of our method is also available at this address. 

4.1. Visual evaluations. First of all, we have tested our algorithm on the videos proposed 
by Wexler et al. [19] and Granados et al [20]. The visual results of our algorithm may be 
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Figure 8. A comparison of our inpainting result with the that of the background inpainting 
algorithm of Granados et al. [19]. In such cases with moving background, we are able to achieve high 
quality results (as do Granados et al.), but we do this in one, unified algorithm. This illustrates the capacity 
of our algorithm to perform well in a wide range of inpainting situations. 



“Crossing Ladies” 



“Jumping girl” 



Original frames 


Inpainting result from [39] 


Our inpainting result 


Figure 9. Comparison with Wexler et al. We achieve results of similar visual quality compared to 
those in [39], with a reduction of the ANN search time by a factor of up to 50 times. 


seen in Figures 9 and 10. We note that the inpainting results of the previous authors on these 
examples are visually almost perfect, so very little qualitative improvement can be made. 
It may be seen that our results are of similarly high quality to the those of the previous 
algorithms. In particular we are able to deal with situations where several moving objects 
must be correctly recreated, without requiring manual segmentation as in [20]. We also achieve 
these results in at least an order of magnitude less time than the previous algorithms. We 
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Inpainting result from [20] 




Our inpainting result 


Figure 10. Comparison with Granados et al. We achieve similar results to those of [20] in an order 
of magnitude less time, without user intervention. The occlusion masks are highlighted in green. 
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Algorithm 4: Proposed video inpainting algorithm. 

Data: Input video u over fl, occlusion H, resolution number L 
Result: Inpainted video 

(u, 6) AlignVideo(^) ; 

ImagePyramid(i^); 

1 Text ureFeaturePyr amid (u); 

{^}f=i <— OcclusionPyramid(7/); 
c/) L ^—Random; 

(?i L , T l , (j) L ) Initialisation (u L , T L , </> L , T~L l ); 

for ^ = L to 1 do 

k — 0, e = 1; 

while e > 0.1 and k < 20 do 

v — r/; 

ANNsearch; 
i/ <— Reconst ruction (i/, <//, T~L^) ; 

Reconstruction(T^, <//, 7^); 

6 — 11 U H £ — V 7-L e II 2 ’ 

k i — k -\-15 

end 

if £ = 1 then 

u FinalReconstruction(iA, 0 1 , H) ; 
else 

<^ -1 ^UpSample(<//, 2) ; 

r / -1 ^Reconstruction^ -1 , <// - 1 ,77^ -1 ); 

T ^ -1 ^ Reconstruct ion (T^ -1 , <// - 1 ,77^ -1 ); 

end 

end 

u Unwarp Video (rq 6) 


11 Alg.2 


// Eqs.3.5-3.7 


// Alg.3 


// Alg.l 
// Eqs.3.2-3.3 


// Eq.3.4 
// Sec. 3.6 


note that it is not feasible to apply the method of Wexler e£ al. to the examples of [20] , whose 
resolution is too large (up to 1120x754x200 pixels). 

Next, we provide experimental evidence to show the ability of our algorithm to deal with 
various situations which appear frequently in real videos, some of which are not dealt with 
by previous methods. Figure 7 shows an example of the utility of using texture features in 
the inpainting process: without them, the inpainting result is quite clearly unsatisfactory. 
We have not directly compared these results with previous work. However it is quite clear 
that the method of [20] cannot deal with such situations. This method supposes that the 
background is static, and in the case of dynamic textures, it is not possible to restrict the 
search space as proposed in the same method for moving objects. Furthermore, the background 
inpainting algorithm of Granados et al. [19] supposes that moving background undergoes a 
homographic transformation, which is clearly not the case for video textures. By relying on 
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a plain colour distance between patches, the algorithm of Wexler et al. is likely to produce 
results similar to the one which may be seen in Figure 7 (middle image). Finally, to take 
another algorithm of the literature, the method of Patwardhan et al. [35] would encounter the 
same problems as that of [19], since they copy-and-paste pixels directly after compensating 
for a locally estimated motion. More examples of videos containing dynamic textures can be 
seen at http://www.telecom-paristech.fr/~gousseau/video_inpainting. 

Our algorithm’s capacity to deal with moving background is illustrated by Figure 8. We 
do this in the same unified framework used for all other examples in this paper, whereas a 
specific algorithm is needed by Granados et al [ 9] to achieve this. Thus, we see that the same 
core algorithm (iterative ANN search and reconstruction) can be used in order to deal with a 
series of inpainting tasks and situations. Furthermore, we note that no foreground/background 
segmentation was needed for our algorithm to produce satisfactory results. Finally, we note 
that such situations are not managed using the algorithm of [39]. Again, examples containing 
moving backgrounds can be viewed at the referenced website. 

The generic nature of the proposed approach represents a significant advantage over pre¬ 
vious methods, and allows us to deal with many different situations without having to resort 
to manual intervention or the creation of specific algorithms. 

4.2. Execution times. One of the goals of our work was to accelerate the video inpainting 
task, since this was previously the greatest barrier to development in this area. Therefore, we 
compare our execution times to those of [39] and [20] in Table 1. 


Algorithm 

ANN execution times for all occluded pixels at full resolution. 

Beach Umbrella 

264 X 68 X 98 

Crossing Ladies 

170 X 80 X 87 

Jumping Girl 

300 X 100 X 239 

Duo 

960 X 704 X 154 

Museum 

1120 X 754 X 200 

Wexler (kdTrees) 

985 S 

942 s 

7877 s 

- 

- 

Ours (3D PatchMatch) 

50 s 

28 s 

155 s 

29 min 

44 min 

Algorithm 

Total execution time 

Granados 

11 hours 

- 

- 

- 

90.0 hours 

Ours 

24 mins 

15 mins 

40 mins 

5.35 hours 

6.2 hours 

Ours w/o texture 

14 mins 

12 mins 

35 mins 

4.07 hours 

4.0 hours 


Table 1 


Partial and total execution times on different examples. The partial inpainting times represent 
the time taken for the ANN search for all occluded patches at the full resolution. Note that for the “museum” 
example, Granados’s algorithm is parallelised over the different occluded objects and the background, whereas 
ours is not. 


Comparisons with Wexler’s algorithm should be obtained carefully, since several crucial 
parameters are not specified. In particular, the ANN search scheme used by Wexler et al. 
requires a parameter, £, which determines the accuracy of the ANNs. More formally, if W p 
is a source patch, W q is the exact NN of W p and W r is an ANN of W p , then the work of 
[4] guarantees that d{W p ,W r ) < (1 + e)d{W p ^W q ). This parameter has a large influence on 
the computational times. For our comparisons, we set this parameter to 10, which produced 
ANNs with a similar average error per patch component as our spatio-temporal PatchMatch. 
Another parameter which is left unspecified by Wexler et al. is the number of iterations 
of ANN search/reconstruction steps per pyramid level. This has a very large influence on 
the total execution time. Therefore, instead of comparing total execution times we simply 
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compare the ANN search times, as this step represents the majority of the computational 
load. We obtain a speedup of 20-50 times over the method of [4]. We also include our total 
execution times, to give a general idea of the time taken with respect to the video size. These 
results show a speedup of around an order of magnitude with respect to the semi-automatic 
methods of Granados et al [20]. In Table 1, we have also added our execution times without 
the use of texture features to illustrate the additional computational load which this adds. 

These computation times show that our algorithm is clearly faster than the approaches of 
[39] and [20]. This advantage is significant because not only is the algorithm more practical 
to use, but it is also much easier to experiment and therefore make progress in the domain of 
video inpainting. 

5. Further work. Several points could be improved upon in the present paper. Firstly, 
the case where a moving object is occluded for long periods remains very difficult and is not 
dealt with in a unified manner here. The one solution to this problem (temporal subsampling) 
does not perform well when complex motion is present. Therefore, other solutions could be 
interesting to explore. Secondly, we have observed that using a multi-resolution texture feature 
pyramid produces very interesting results. Therefore, we could perhaps enrich the patch space 
with other features, such as spatio-temporal gradients. Finally, it is acknowledged that videos 
of high resolutions still take quite a long time to process (up to several hours). Further 
acceleration could be achieved by dimensionality-reducing transformations of the patch space. 

6. Conclusion. In this paper, we have proposed a non-local patch-based approach to 
video inpainting which produces good quality results in a wide range of situations, and on 
high definition videos, in a completely automatic manner. Our extension of the PatchMatch 
ANN search scheme to the spatio-temporal case reduces the time complexity of the algorithm, 
so that high definition videos can be processed. We have also introduced a texture feature 
pyramid which ensures that dynamic video textures are correctly inpainted. The case of 
mobile cameras and moving background is dealt with by using a global, affine estimation of 
the dominant motion in each frame. 

The resulting algorithm performs well in a variety of situations, and does not require any 
manual input or segmentation. In particular, the specific problem of inpainting textures in 
videos has been addressed, leading to much more realistic results than other inpainting algo¬ 
rithms. Video inpainting has yet not been extensively used, in a large part due to prohibitive 
execution times and/or necessary manual input. We have directly addressed this problem in 
the present work. We hope that this algorithm will make video inpainting more accessible 
to a wider community, and help it to become a more common tool in various other domains, 
such as video post-production, restoration and personal video enhancement. 

7. Acknowledgments. The authors would like to express their thanks to Miguel Granados 
for his kind help, for making his video content publicly available and for answering several 
questions concerning his work. The authors also thank Pablo Arias, Vicent Caselles and 
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inpainting method. 

Appendix A. On the link between the non-local patch-based and shift map-based 
formulations. 
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In the inpainting literature, two of the main approaches which optimise a global objective 
functional include non-local patch-based methods [3, 39], which are closely linked to Non- 
Local Means denoising [10], and shift map-based formulations such as [20, 29, 36]. We will 
show here that the two formulations are closely linked, and in particular that the shift map 
formulation is a specific case of the very general formulation of Arias et al. [3, 2 ]. 

As far as possible we keep the notation introduced previously. In particular p is a position 
in 1~L, q is a position in J\f p and r is a position in V. In its most general version, the variational 
formulation of Arias et al. is not based on a one-to-one shift map, but rather on a positive 
weight function w : D x D —>> M + that is constrained by J2 r w(p,r) = 1 to be a probability 
distribution for each point p. Inpainting is cast as the minimisation of the following energy 
functional (that we rewrite here in its discretised version): 


(A.l) L/ ar i as (?i, K 7 ) — EE w(p,r)d 2 (W p , W r ) +7 ^2 w(p, r) log w(p,r) , 

pen \_ re v re v 

with 7 a positive parameter. The first term is a “soft assignment” version of (2.1), while the 
second term is a regulariser that favors large entropy weight maps. 

Arias et al. propose the following patch distance: 

(A. 2 ) d 2 (W p , W r ) = ^ g a (p - q)p[u{q) - u(r + (p - g))], 

qeNp 


where g a is the centered Gaussian with standard deviation a (also called the intra-patch weight 
function), and p is a squared norm. This is a very general and flexible formulation. 

Arias et al. optimise this function using an alternate minimisation over u and w , and 
derive solutions for various patch distances. Let us choose the £ 2 distance for d(W p ,W r ), as 
is the case in many inpainting formulations (and in particular the one which we use in our 
work). In this case, the minimisation scheme leads to the following expressions: 


(A.3) 

(A.4) 


w(p,r) = — exp 

Zj p 



u(p) 


J 2 9a(p~q)(j2 w(q , r)u(r + (p — q))^ . 

qeNp re f) 


The parameter 7 controls the selectivity of the weighting function w. Let us consider the 
case where each weight function w(p, .) is equal to a single Dirac centered at a single match 
p + (j)(p)- If, in addition, we consider that the intra-patch weighting is uniform, in other words 
a — 00 , the cost function reduces to: 


(A-5) £arias(w,^) = ^2 52 IK?) “ “(? + 0(p))lll> 

pen qeNp 


which is the formulation of Wexler et al. [39] . 

Rewriting (A.4) in the particular case just described, yields the optimal inpainted image 


(A. 6 ) 


u(p) 


—L u (p+ e 

|A/pl qeK 
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as the (aggregated) average of the examples indicated by the NNs of the patches which contain 
p. Suppose that each pixel is reconstructed using the NN of the patch centred on it: 

(A.7) u(p) = u(p + <f>(p)), Mp G H, 

as is the case in the shift map-based formulations. Then the functional becomes 1 : 

(A. 8 ) E ar - ias (u,4>) = E E IH^ + ^k))+ 

pen qeAf p 

which effectively depends only upon cj). 

If we look at the shift map formulation proposed by Pritch et al. , we find the following 
cost function over the shift map only: 

(A.9) £ P ritch(0) = J2 J2 (lK9)-«(9 + ^(p))ll 2 + l|V«(5)-V U (g + ^(p))||l). 

peW qeN p 

Let us consider the first part, concerning the image colour values. Since we have u(p) = 
u(p + (f>(p))i we obtain again: 

(A. 10) ^pritch (0) = E E \\u{q + 4>(q)) ~ u(q + 4>(p))\\l. 

pen q eJV(p) 

Thus, the shift map cost function may be seen as a special case of the non-local patch-based 
formulation of Arias et al. under the following conditions : 

• d 2 (w p , w r ) = Y, q& M P IK«) - u ( r + (p - q)) Hi; 

• 7 = 0; 

• The intra-patch weighting function g a is uniform; 

• u(p) = u(p + 0 (p)), that is, u(p) is reconstructed using its correspondent according to 
a single shift map. 

Arias et al. [ 2 ] have shown the existence of optimal correspondance maps for a relaxed 
version of Wexler’s energy (Equation A.l), and also that the minima of Equation A.l) converge 
to the minima of Equation A.5 as 7 —» 0. Such results also highlight the link between different 
inpainting energies. 

However, one should keep in mind that the two formulations which we have considered 
(that of Arias et al. and Pritch et al.) are certainly not equivalent, for reasons such as the 
difference in optimisation methodology and the presence of a gradient term in the formulation 
of Pritch et al. Furthermore, the choice of the reconstruction u(p) = u(p + (j>{p)) was not 
considered by Arias et al ., meaning that results may differ. 

Appendix B. Comparing textured patches. 

In this Appendix, we look in further detail at the reasons why textures may pose a problem 
when comparing patches for the purposes of inpainting. Liu and Caselles noted in [29] that 
the subsampling necessary for the use of multi-resolution pyramids inevitably entails a loss 

2 We note that using the reconstruction of Equation A.7 poses problems on the occlusion border, but we 
ignore this here for the sake of simplicity and clarity. 
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Patch size 

3x3 

5x5 

3x3x3 

7x7 

9x9 

11 x 11 5x5x5 

Probability 

8 x 10“* 

icr 2 

6 x 1CT 3 

4.1 x icr 4 

5.5 x icr G 

3 x icr 7 2 x icr 7 


Table 2 


Probability of producing a random 2D or 3D patch that is closer to a random reference 
patch than to a constant one with same mean value. Values are obtained through numerical simulations 
averaged over ten runs for each experiment. Components of random patches are i. i. d. according to the centred 
normal law with a grey level variance of 25. 


of detail, leading to difficulties in correctly identifying textures. In fact, we found that this 
difficulty may occur at all the pyramid levels in images and videos. Roughly speaking, we 
observed that textured patches are quite likely to be matched with smooth ones. The following 
simple computations quantify this phenomenon. 

B.l. Comparing patches with the classical £ 2 distance. The first reason concerns the 
patch distance. Let us consider a white noise patch, W, which is a vector of i.i.d. random 
variables W\ • • • Wjg, where N is the number of components in the patch (number of pixels 
for grey-level patches), and the distribution of all W^s is fw • Let /i and a 2 be, respectively, 
the average and variance of fw- Let us consider another random patch V following same 
distribution, and the constant patch Z, composed of Zi = /i, i = 1 • • • N. 

In this simple situation, we see that E[||W — V^ll] = 2E[||W — Z|| 2 ]. Therefore, on average, 
the sum-of-squared-differences (SSD) between two patches of the same distribution is twice 
as great as the SSD between a randomly distributed patch and a constant patch of value /a . 

The previous remark is only valid on average between three patches W, V and Z. In 
reality, we have many random patches V to choose from, and it is sufficient that one of these 
be better than Z for the least patch distance to identify a “textured” patch. Therefore, a more 
interesting question is the following. Given a white noise patch W, what is the probability 
that the patch V will be better than the constant patch Z. This is slightly more involved, 
and we shall limit ourselves to the case where W and V consist of i.i.d. pixels with a normal 
distribution. 

The SSD patch distance between W and V follows a chi-square distribution x 2 (0,2cr 2 ), 
and that between W and Z follows x 2 (0,cr 2 ). With this, we may numerically compute the 
probability of a random patch being better than a constant one. Since the chi-squared law is 
tabulated, it is much the same thing to use numerical simulations. 

In Table 2, we show the corresponding numerical values for both 2D (image) and 3D 
(video) patches. It may be seen that for a patch of size 9x9, there is very little chance of 
finding a better patch than the constant patch. In the video case, we see that in the case of 
5x5x5 patches, there is a 2 x 10 -7 probability of creating a better patch randomly. This 
corresponds to needing an area of 170 x 170 x 170 pixels in a video in order to produce on 
average one better random patch. While this is possible, especially in higher-definition videos, 
it remains unlikely for many situations. 

The question naturally arises of why the problem of comparing textures has not been 
more discussed in the patch-based inpainting literature. Indeed, to the best of our knowledge, 
only Bugeau et al. [11] and Liu and Caselles [29] have clearly identified this problem in 
the case of image inpainting. This is due to the fact that most other inpainting algorithms 
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Figure 11. A toy example of the utility of the textures features. With them, we are able to 
distinguish between white noise (right) and the constant area (left), and thus recreate the noise. 


restrict the ANN search space to a local neighbourhood around the occlusion. Unfortunately, 
this restriction principle does not hold in video inpainting since the information can be found 
anywhere in the video volume, in particular when a complex movement must be reconstructed. 

B.2. ANN search with PatchMatch. We have indicated that the (A grey-level/colour 
patch distance is problematic for inpainting in the presence of textures. Additionally, this 
problem is exacerbated by the use of PatchMatch. Indeed, the values of (j) which lead to 
textures are not piecewise constant, and are therefore not propagated through (j) during the 
PatchMatch algorithm. On the other hand, smooth patches represent on average a good com¬ 
promise as ANNs and are the shifts which lead to them are piecewise constant and therefore 
well propagated throughout (j). Another problem which may lead to smooth patches being 
used is the weighted average reconstruction scheme. This can lead to blurry results which in 
turn means that smooth patches are identified. 

One solution to these problems is the use of our texture feature pyramid (§3.3). This 
pyramid is inpainted simultaneously with the colour video pyramid, and thus helps to guide 
the algorithm in the choice of which patches to use for inpainting. 

Figure 11 shows an interesting situation: we wish to inpaint a region which contains white 
noise. This toy example serves as an illustration of the appeal of our texture features. Indeed, 
it is quite clear that without them, there is no chance of inpainting the occlusion in a manner 
which would seem “natural” to a human observer, whereas with them it is possible, in effect, 
to “inpaint noise”. 
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